Team I Functional Annotation Group: Difference between revisions

From Compgenomics 2020
Jump to navigation Jump to search
Hcheon6 (talk | contribs)
Hcheon6 (talk | contribs)
Line 31: Line 31:
=== Homology-based Tools ===
=== Homology-based Tools ===


Homology between genes means they share ancestry. Homologous genes that have recently diverged usually share function. We are looking to transfer annotation on known genes to our predicted genes by finding homology genes.
Homology between genes means they share ancestry. Homologous genes that have recently diverged usually share function. We are looking to transfer annotation on known genes to our predicted genes by finding homology genes. Gene databases are collections of annotated genes. When we search a gene against a database, the search is looking for homology between our gene sequences and those in the database to determine what our genes’ function will be.
Gene databases are collections of annotated genes. When we search a gene against a database, the search is looking for homology between our gene sequences and those in the database to determine what our genes’ function will be.


Homology-based tools are more accurate and reliable than Ab-initio tools, and can be targeted for specific purposes, e.g. antibiotic resistance genes. However, homology tools are dependent on existing annotations and on what databases are being searched.


==== Prokka ====
==== Prokka ====

Revision as of 22:47, 28 March 2020

Team 1 Functional Annotation

Team members: Kenji Gerhardt, Maria Ahmad, Manasa Vegesna, Shuheng Gan, and Hyeonjeong Cheon

Introduction

Background

Functional annotation is the process of associating predicted genes with their functional role in a cell. This includes the type of gene (translated/untranslated), their location within the cell, and their chemical and biological roles. These can be derived both by comparison to similar, already annotated genes with known (or anticipated) functions, or through ab initio annotation that relies on existing models, but which does not rely on databases of other annotated genes.

Objective

Our goal in this step is to functionally annotate the genes supplied to us by the gene prediction group.

Data

The genomes supplied to us were identified during assembly as E. coli. E. coli is the most highly studied microorganism: at present, NCBI contains 1072 complete E. coli genomes in its genome database, along with thousands of additional chromosomes, contigs, and scaffolds. This depth of study extends to its genes, where multiple pathogenic strains are identified and characterized well.

Pipeline

Pipeline image will be inserted

Clustering

Methods

Ab-initio Approach

Ab-Initio Tools predict and annotate different regions of the prokaryotic genome using 1) sequence composition, 2) likelihoods within the gene models, 3) gene content, and 4) ignal Detection. Ab-Initio Approach can be used for finding new genes, and no external data or evidence is needed for the prediction. However, it is limited by the presence of False Positives in the predicted data as well as over-prediction of small genes.

Homology-based Tools

Homology between genes means they share ancestry. Homologous genes that have recently diverged usually share function. We are looking to transfer annotation on known genes to our predicted genes by finding homology genes. Gene databases are collections of annotated genes. When we search a gene against a database, the search is looking for homology between our gene sequences and those in the database to determine what our genes’ function will be.

Homology-based tools are more accurate and reliable than Ab-initio tools, and can be targeted for specific purposes, e.g. antibiotic resistance genes. However, homology tools are dependent on existing annotations and on what databases are being searched.

Prokka

Command:

prokka <input_file> --outdir <output directory> --prefix <file prefixes> --kingdom Bacteria --locustag --norrna --notrna

EggNOG

Command:

python emapper.py -i <input_file> --output <output_file> -m [diamond,hmm] -d <database_name> -o <output directory>

Interproscan

DeepARG

Results

Reference