Team II Functional Annotation Group: Difference between revisions

Revision as of 09:52, 30 March 2020

Team 2: Functional Annotation

Team Members: Danielle Temples, Courtney Astore, Rhiya Sharma, Ujani Hazra, Sooyoun Oh

Introduction

What is Functional Annotation?

The practice of putting biological meaning to coding genes (genes that encode proteins) and their corresponding protein sequences. Such annotations can be derived using homology and ab initio based approaches, which will be further explained in subsequent sections.

Objective: Perform a full functional annotation on the genes and proteins determined by the Gene Prediction group that is relevant to C. jejuni

Homology Approaches

Determine function via sequence similarity to already functionally annotated sequences
Limited by what we already know.

Ab Initio Approaches

Determine function via predictive model without comparing to existing sequences
Based on laws of nature
Difficult to verify without experiments

Data Overview

We received 50 fna and 50 faa files from the gene prediction group. The 50 fna files are multifasta files representing each genome. The 50 faa files are multifasta files representing each proteome.

Clustering

Significant sequence similarity implies shared ancestry that often leads to shared function
Clustering such sequences can reduce repeat queries in homology-based annotations
Reducing repeats improves speed and storage costs

CD-HIT

Widely used program for clustering and comparing protein or nucleotide sequences
Very fast and can handle extremely large databases

./cd-hit -i [input file] -o [output file name]

We were able to get 90305 protein sequences down to 7414 representative clusters.
We were able to get 90305 DNA sequences down to 3989 representative clusters.

Homology Methods

eggNog-Mapper

Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups
Functional Annotation of large sequence sets by fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database
Use HMM-based search and Diamond based searches to lead to the best seed ortholog in eggNOG
Orthology predictions for functional annotation results in a higher precision than traditional homology searches by avoiding transferring annotations from close paralogs

Run eggNog-mapper for clustering results against eggNOG bacteria database.

./emapper.py  -i <cluster> --output <result> -d bact -m diamond

VFDB

Virulence Factor DataBase
Provide virulence structure features, functions, and mechanisms used to allow pathogens to conquer new niches and circumvent host defense mechanisms
BLAST based identification of virulence genes

Download VFDB core dataset (includes genes associated with experimentally verified VFs only) and convert into BLAST database.

makeblastdb -in VFDB_db -dbtype 'nucl' -out <db_name>

Run BLASTn for clustering results against VFDB database.

blastn -db <db_name> -query <cluster> -out <result> -max_hsps 1 -max_target_seqs 1 -outfmt 6 -perc_identity 100 -num_threads 5

CARD

Comprehensive Antibiotic Resistance Database
Provides data, models, and algorithms relating to the molecular basis of antimicrobial resistance
Can be used for the analysis of genome sequences using the Resistance Gene Identifier

Install the RGI software using conda & download CARD database:

wget https://card.mcmaster.ca/latest/data
tar -xvf data ./card.json

Build a local database:

rgi load --card_json <path to card.json> --local

Run RGI main with protein sequences:

rgi main -i <path to cluster.faa> -o <output_file_name> -t protein --local

Create a tab-delimited file from rgi results:

rgi tab -i <path to output_file_name.json>

MicrobesOnline

Predicts whether pairs of adjacent genes that are on the same strand are in the same operon, based on the intergenic distance between them, whether orthologs of the genes are near each other in other genomes, and their predicted functions.

Download all the operon tables for Campylobacter jejuni.
Use GID to retrieve the protein reference sequences.
Create a database using BLAST:

makeblastdb -in <fasta file > -dbtype prot -out <database>

Query the clustered sequences with the reference protein database:

blastp -query cdhit/faa_rep_seq.faa -db tmp/db_operon -evalue 0.01 -max_target_seqs 1 -max_hsps 1 -outfmt 6 -out tmp/hits_0.01.txt -num_threads 5

Ab Initio Methods

TBBpred

Uses neural networks/SVM approach
Predicts the transmembrane Beta barrel regions in a given protein sequence

TMHMM2

Uses Hidden Markov Model approach
Outputs the number of transmembrane helices predicted, the position of each residue with respect to the cell membrane (outside/inside the cell, transmembrane segment), and optional graphical output visualizing the predicted helix

Input a multifasta file into TMHMM.

tmhmm <input multifasta file> > <output file>

PilerCR

Rapid identification and classification of CRISPR repeats
High sensitivity and high specificity

Input a multifasta file into PileCR.

pilercr –in <input multifasta file> -out <output file> -noinfo –quiet

SignalP

Uses a deep neural network algorithm
Predicts presence of signal peptides and location of cleavage sites
Does not determine lipoproteins

Input a fasta file into SignalP.

signalp –fasta <input_sequence_file> -org gram- -format short –gff3

Initial Pipeline

Results

Clustering

Ab Initio

Transmembrane Proteins

Minimum = 810
Maximum = 1,049
Average = 928.8

Signal Peptides

Minimum = 71
Maximum = 99
Average = 94.3

Lipoproteins

Minimum = 43
Maximum = 67
Average = 52.8

CRISPR

Minimum = 0
Maximum = 2
Average = 0.8

Homology

Antibiotic Resistance

Minimum = 1
Maximum = 10
Average = 3.8

Virulence Factors

Total = 132
Minimum = 1
Maximum = 13
Average = 3.1

Operons

Minimum = 505
Maximum = 856
Average = 678.1

eggNog-mapper Ouput

EggNog-mapper provides useful functional annotations such as seed eggNOG ortholog, seed ortholog evalue/score, predicted taxonomic group, predicted protein name, GO terms, KEGG ko, KEGG Pathways, eggnog OG, COG Functional Category, free text descriptions, and many more. The descriptions of the COG Functional Categories are shown below.

eggNog-mapper Results

Total = 44,098
Minimum = 669
Maximum = 1087
Average = 881.9

@@ Line 216: / Line 216: @@
 == '''Final Pipeline'''==
-[[File:completo.png|border|900px|Final Pipeline]]
+[[File:completoo.png|border|900px|Final Pipeline]]
 == '''References'''==

Team II Functional Annotation Group: Difference between revisions

Revision as of 09:52, 30 March 2020

Contents

Team 2: Functional Annotation

Introduction

What is Functional Annotation?

Homology Approaches

Ab Initio Approaches

Data Overview

Clustering

CD-HIT

Homology Methods

Categories

eggNog-Mapper

VFDB

CARD

MicrobesOnline

Ab Initio Methods

Categories

TBBpred

TMHMM2

PilerCR

SignalP

Initial Pipeline

Results

Clustering

Ab Initio

Homology

Final Pipeline

References

Navigation menu

Team II Functional Annotation Group: Difference between revisions

Revision as of 09:52, 30 March 2020

Team 2: Functional Annotation

Introduction

What is Functional Annotation?

Homology Approaches

Ab Initio Approaches

Data Overview

Clustering

Homology Methods

Categories

Ab Initio Methods

Categories

Initial Pipeline

Results

Clustering

Ab Initio

Homology

Final Pipeline

References

Navigation menu

Search