Team II Gene Prediction Group

From Compgenomics 2020
Jump to navigation Jump to search

Team 2: Gene Prediction

Team Members: Danielle Temples, Kara Lee​, Paarth Parekh, Shuting Lin

Introduction

Our Project

Purpose: Investigate an unknown outbreak pathogen using raw genome sequence data from the Centers for Disease Control and Prevention (CDC) foodborne illness surveillance outbreak investigations​

Overall Objective: Identify and characterize the pathogenic organism, make recommendations for the outbreak control, and build a public webserver that automates the computational steps​

Our Objective: From assembled genomes, predict genes or features using different prediction methods and evaluate selected tools on their accuracy and performance

What is Gene Prediction?

Identification of the regions of genomic DNA that encode genes, which are fragments of DNA that encodes a functional molecule:

  • Protein-coding genes​
  • RNA genes​
  • May also include other functional elements (i.e. regulatory regions)

Prokaryotic Genome

Prokaryotic Genomes have a high gene density and do not contain introns in their protein-coding regions. Genes are called Open Reading Frames or “ORFs” (include start & stop codon).

Prediction of prokaryotic genes tends to be relatively simpler with contiguous ORFs. However, overlapping ORFs and short genes can cause issues. Each gene is an ORF, but not every ORF is a gene.

ORF

Homology Methods

  • Makes predictions via comparisons with sequences of previously known genes​
  • Based on extrinsic information​
  • Can be used to validate/support Ab Initio findings​
  • Limited by the use of no new knowledge

BLAST

Useful for dentifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Algorithm:

  1. Break query into words of length W​
  2. Align words with sequence in database & identify matches​
  3. Calculate T score for matches​
  4. Extend sequence in both directions until score falls below cutoff (HSPs)​
  5. Report hits that meet or exceed BLAST cutoff for statistically significant hits

BLAST Algorithm

GHOSTZ

A new faster homology search method using database subsequence clustering.

Algorithm:​

  1. Sequences are extracted from a database & similar ones are clustered​
  2. Construct into hash tables​
  3. Use hash tables to select seeds for the alignments from representative sequences in the clusters​
  4. Distance between a query subsequence and cluster representative is calculated​
  5. Lower bounds calculated​
  6. Similarity Filtering – if computed lower bound is less than or equal to distance threshold, continue

GHOSTZ Algorithm Similarity Filtering

Ab Initio Methods

  • Inspect the input sequence and searches for traces of gene presence​
  • Simplest method is to inspect ORFs​
  • Relies on probability models​ & specific DNA motifs (signals)​
  • Markov Models and Dynamic Programming

GeneMarkS-2

PRODIGAL

Glimmer3

Tool Evaluation

Non-Coding RNA Methods

  • Non-coding RNA (ncRNA) - an RNA molecule that is not translated into a protein​
  • Transfer RNAs (tRNAs), ribosomal RNAs (rRNAs) and small RNAs(sRNAs)​
  • Protein synthesis/Translation (tRNA and rRNA) & gene regulation (sRNA)
  • Related to antibiotic resistance

ARAGORN

  • Homology based tool​
  • Uses heuristic algorithms that score the tRNA and tmRNA genes based on their sequence and secondary structure similarities
  • An effective tRNA search program, with sensitivity better than other current heuristic tRNA search algorithms

RNAmmer

  • Ab Initio based tool​​
  • Uses Hidden Markov Models trained on data from 5s rRNA database​​
  • Fast with little loss of sensitivity, enabling the analysis of a complete genome in less than a minute​
  • Location of rRNAs can be predicted with a very high accuracy

Infernal

  • Implementation of covariance models (CMs)​
  • RNA homology search based on accelerated profile HMM methods and HMM-banded CM alignment methods​
  • 100-fold faster RNA homology searches and ∼10,000-fold acceleration over exhaustive non-filtered CM searches

Results

Final Pipeline

References