Team I Genome Assembly Group: Difference between revisions

From Compgenomics 2020
Jump to navigation Jump to search
Blanked the page
Tag: Blanking
 
(40 intermediate revisions by 5 users not shown)
Line 1: Line 1:
==''' Team 1 Genome Assembly '''==


==== Team members: Lawrence McKinney, Laura Mora, Jessica Mulligan, Heather Patrick, Devishi Kesar, and Hyeonjeong Cheon ====
=='''Introduction and Objectives'''==
In bioinformatics, sequence assembly of a genome is the first of many steps involved to identify and characterize an organism. It may be considered the most important step in the stages of analysis and interpretation (see below) because of the challenge that persist concerning high quality genome assembly and annotation [4].
The basic principle of genome assembly is to note that the more similarity that exists between the end of one read and the beginning of another, the more likely they are to have originated from overlapping stretches of the genome. The output of an assembly is typically a set of ‘‘contigs,’’ which are contiguous sequence fragments, ordered and oriented into ‘‘scaffold’’ sequences, with gaps between contigs within scaffolds representing regions of uncertainty. There are numerous subclasses of assembly problems that can be distinguished by, among other things, the nature of: (1) the reads, (2) the types of sequences being assembled, and (3) The availability of homologous (related) and previously assembled sequences, such as a reference genome or the genome of a closely related species.
Using the most relevant and high-quality tools is important for performing comparative genomics. Importantly, results may have implications in public health that may affect many lives. For the purposes of our project, we are tasked with identifying the source/origin of a given food-borne illness from 50 isolates in FASTQ format that were sequenced using Illumina technology. We are also tasked with developing a workflow for our analysis and using publically available bioinformatic tools that we deem appropriate for identifying our given organism.
==== Stages of analysis and interpretation of data ====
#[http://compgenomics2020.biosci.gatech.edu/Team_I_Genome_Assembly_Group genome assembly]
#[https://compgenomics2020.biosci.gatech.edu/index.php?title=Team_I_Gene_Prediction_Group&action=edit&redlink=1 gene prediction]
#[http://compgenomics2020.biosci.gatech.edu/index.php?title=Team_I_Functional_Annotation_Group&action=edit&redlink=1 functional annotation]
#[http://compgenomics2020.biosci.gatech.edu/index.php?title=Team_I_Comparative_Genomics_Group&action=edit&redlink=1 comparative genomics]
#[http://compgenomics2020.biosci.gatech.edu/index.php?title=Team_I_Webserver_Group&action=edit&redlink=1 production of a predictive webserver]
==== Overview of Genome Assembly ====
[[File: GenomeAssemblyWorkflow.png|Genome Assembly Overview| 600x600px]]
'''Figure 1'''. General Genome Assembly Workflow (Image retrieved from [https://www.1010genome.com/genome-assembly 101genome])
----
==== Team Goals ====
1. '''To perform quality control on reads before and after assembling the genome:'''
'''Before:'''
** [https://github.com/OpenGene/fastp Fastp]
'''After:'''
** [https://github.com/ablab/quast QUAST]
** [https://www.sanger.ac.uk/science/tools/reapr REAPR] (NOTE:REAPR was not fully evaluated as a viable quality control tool due to time constraints)
2. '''To evaluate the performance of assembly tools:'''
** [https://www.bcgsc.ca/resources/software/abyss Abyss]
** [https://galaxyproject.github.io/training-material/topics/assembly/tutorials/unicycler-assembly/tutorial.html Unicycler]
** [https://software.broadinstitute.org/allpaths-lg/blog/ ALLPATHS-LG]
** [https://github.com/ablab/spades SPADES]
** [https://github.com/ncbi/SKESA SKESA]
** [https://www.psc.edu/user-resources/software/masurca MaSuRCA]
** [https://www.ebi.ac.uk/~zerbino/velvet/ Velvet]
3. '''To use the best tool to perform de novo assembly based on the 50 isolates.'''
4. '''To send off the highest quality result to the gene prediction team.'''
** [https://compgenomics2020.biosci.gatech.edu/index.php?title=Team_I_Gene_Prediction_Group&action=edit&redlink=1 Team 1 | Genome Prediction]
== '''Methods''' ==
=== Proposed Genome Assembly Pipeline ===
[[File:Genome Assembly Pipeline 1.png| 700x800px|border]]
'''Figure 2.''' A proposed pipeline for genome assembly based upon class lectures and review of literature.
=== Raw Sequence Data ===
Our team was assigned 50 isolates sequenced using an Illumina sequencer. All isolates are in FASTQ format. FASTQ is a standard format that contains the read sequence and a quality for every base.
=== Quality Control & Trimming ===
It is important to check quality of sequences prior to proceeding to assembly including checking the quality of the average base quality score per read, the GC content distribution and identification of the most duplicated reads.  Proceeding with assembly of the sequences without checking for the quality of the reads will lead the misinterpretation of results. For the purposes of our project we will be using fastp() for quality control analysis as well as read trimming. We will use Fastp because it includes most features of similar tools used for quality control and trimming (including FASTQC + Cutadapt + Trimmomatic + AfterQC), all while running 2 to 5 times faster than any of them alone.
:::::::::::::::::::::::* Our threshold Minimum quality score: '''20'''
:::::::::::::::::::::::* Our Sliding window: '''10'''
:::::::::::::::::::::::* We will trim reads from both ends
[[File: Fastp_3.png|center|600x600px|border]]
:::::::::::::::'''Figure 3'''. Fastp output: Optimal sliding window and minimum quality score parameters for our analysis
=== Assembly ===
An important decision when approaching the assembly stage of genome is to determine whether performing assembly via a reference genome or de novo assembly. Reference assembly refers to aligning the sequences to a known genome. De novo assembly refers to aligning the sequences via overlapping groups called contigs to build a novel genome. There are pros and cons to each approaches. For assembly to a reference genome, the assembly can be done relatively quickly. It is also good for single nucleotide variants (SNV) and small indels. However the reference genome approach is limited length for feature detection and requires a reference that is close to the sequenced data. The de novo assembly approach is good for completely new sequences not present in the reference. However it can be slow and has high infrastructure requirements and is bad at hiding raw data limitations.
It is important to understand characteristics of the biological data you are working with to make certain decisions when deciding to use a bioinformatic tool. The genome used for our project is not human and likely prokaryotic (due to basepair count).  Unlike the human genome, that is about 99.9% the same across species, bacterial and viral genomes can change relatively quickly, making the approach of genomic assembly via a reference genome not ideal due to the potential loss of information when mapping occurs. For this reasons, we will proceed in assembling the sequence reads using a de novo assembly approach.
[[File:Reference_assembly.jpg|500x500px|center]]
[[File: DenovoAssembly.png|550x550px|center]]
::::::::::::::::::::'''Figure 4.''' Reference genome assembly diagram (top). De novo genome assembly diagram (bottom).
==== de Novo Assembly ====
De novo assembly is the most common type of genome assembly for short read sequences. It involves reconstructing entire genome from overlapping sequence reads. The quality depends on the size of the reads and number of gaps between them. Most tools use either de Bruijn graphs (Eulerian) or Overlap graphs (Hamiltonian) as their algorithms. De novo assembly can generate new and accurate reference sequences, even for complex genomes.
===Qualitative evaluation of genome assembly tools ===
==== Rationale for (Assembler) tool selection ====
Based on literature that provide an unbiased assessment of pro/cons of various genome assembly tools. (NOTE: Not all pros/cons are displayed)
** [https://genome.cshlp.org/content/early/2012/01/12/gr.131383.111.full.pdf+html GAGE (Genome Assembly Gold-standard Evaluations)]
*** A study designed to provide a snapshot of how the latest genome assemblers compare on a sample of large-scale next-generation sequencing projects.
*** Helps to answer questions like: Which assembly software will produce the best results?
** [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227110/ Assemblathon I]
*** A (critical)  competitive assessment of de novo short read assembly methods.
*** Aims to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies.
{| class="wikitable"
|-
!colspan="1"|Tool
!colspan="1"|Pros
!colspan="1"|Cons
|-
|ABySS
|Uses a distributed k-mer hash table, making it more RAM-efficient
|Produces low N50 contig and scaffold numbers
|-
|ALLPATHS-LG
|High xontig N50 value
|Speed is slow
|-
|SPADES
|Large contigs
|Generates small contigs if coverage is low
|-
|SKESA
|High sequence quality and contiguity
|Does not have a built-in scaffolding tool
|-
|Velvet
|Great for sequences rich in repeat segments
|Small N50 contig size
|-
|Unicycler
|Assembles larger contigs with fewer misassemblies
|Runtime is long
|-
|MaSuRCA
|Usable for large genomes, High NGA50 contig size
|Speed is slow
|-
|}
'''Table 1.''' Qualitative analysis of select genome assembly tools.
==== Assembly Validation Criteria ====
It is generally the case that the right answer to an assembly problem is unknown. Understandably therefore, common methods for assessing assembly quality are needed. Below is our evaluation criteria for determining the best genome assembly tool to use for our project.
{| class="wikitable"
|-
!colspan="1"|'''Metric'''
!colspan="1"|'''Description'''
|-
|N50
|The minimum contig length crossing the 50% threshold of the total assembled size of the genome.
|-
|L50
|The smallest number of contigs whose summed length is N50.
|-
|Number of Contigs
|The total number of contigs in the assembly
|-
|Largest Contigs
|The length of the largest contig in the assembly
|-
|Total Length
|The total number of bases in the assembly
|}
'''Table 2.''' Evaluation criteria of genome assembly tools.
==== QUAST ====
Quast is a quality assessment tool used for genome assembly. QUAST can evaluate assembly quality with or without a reference genome, so that it is useful for new species which do not yet have a finished reference genome [1].
== '''Results''' ==
=== Quantitative Comparison of Genome Assemblers ===
Using our evaluation criteria, 5 randomly selected isolates (CGT1294, CGT1893, CGT1966, CGT1977, and CGT1990) representing 10% of our 50 total isolates were used to determine which genome assembly tool performed best. Analysis of contigs.fa files for each tool was assessed using 'QUAST'. The best tool for downstream analysis was chosen and was used for further analysis on the remaining 45 isolates.
==== Number of Contigs ====
[[File: Number_of_contigs_new.png|700x900px]]
'''Figure 5.'''
----
==== Largest Contigs ====
[[File: Largest_contig_new.png|700x900px]]
'''Figure 6.'''
----
==== Total Length ====
[[File: Total_length_new.png|700x900px]]
'''Figure 7.'''
----
==== N50 ====
[[File: N50_new.png|700x900px]]
'''Figure 8.'''
----
==== L50 ====
[[File: L50_new.png| 700x900px]]
'''Figure 9.'''
----
=== Assemblers Eliminated from Consideration ===
** Velvet -
** ABySS -
** SPADES -
** SKESA -
=== Final Assemblers ===
** MaSuRCA
** Unicycler
== Identification of Organism ==
[[File: E.coli.jpg|400x400px]]
'''Figure 10.''' Image of ''E. coli" | Image retrieved from: ([https://health.clevelandclinic.org Cleveland Clinic])
Identity of organism: ''E. coli''
**Genome size:
**Average GC content:
**Disease state: Symptoms vary from person to person. Often involves severe stomach cramps, diarrhea (often bloody), and vomiting. Some people may have a fever, which usually is not very high (less than 101˚F/38.5˚C). Most people get better within 5 to 7 day. Some infections are very mild, but others are severe or even life-threatening [4].
===== National Enteric Disease Surveillance =====
[[File: EcoliOutbreak.png|500x500px]]
*''The “Total” category includes culture-confirmed infections of serogroup O157, non-O157 serogroups, and rough isolates.''
*''The “Rough” category includes isolates with an O antigen that could not be determined because the strain autoagglutinated (agglutinated in all antisera and diluent). Strains behaving in this manner are often blocked in one or more steps of O antigen synthesis and typically appear flat with irregular edges when grown on solid media. These isolates could be O157 or non-O157 STEC.''
*''The “Unknown” category includes STEC infections detected exclusively by CIDT and culture-confirmed STEC infections reported to LEDS without serogroup information. LEDS does not currently collect information on test type and cannot differentiate between these two types of reports.''
*STEC =  Shiga toxin-producing ''Escherichia coli''
*CIDT = Culture-independent diagnostic tests
*LEDS = Laboratory-based Enteric Disease Surveillance
'''Figure 11.''' Incidence of human STEC infection reported to LEDS, by serogroup and year, United States, 1996–2016 (Data retrieved from [https://www.cdc.gov/ecoli/index.html CDC])
=== Final Genome Assembly Pipeline ===
[[File: Final_Genome_Assembly_Pipeline.png|600x600px]]
== '''Conclusion''' ==
In conclusion,
== In-Class Presentations ==
*'''Genome Assembly Background and Strategy:''' [[File: Team_1_Genome_Assembly_Presentation_1.pdf]]
*'''Genome Assembly Final Results:'''
== References ==
1. Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, Glenn Tesler, QUAST: quality assessment tool for genome assemblies, Bioinformatics, Volume 29, Issue 8,
15 April 2013, Pages 1072–1075, https://doi.org/10.1093/bioinformatics/btt086
2. Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol.
2012;19(5):455–477. doi:10.1089/cmb.2012.0021
3. Butler, Jonathan et al. “ALLPATHS: de novo assembly of whole-genome shotgun microreads.” Genome research vol. 18,5 (2008): 810-20.
doi:10.1101/gr.7337908
4. CDC, (2017) Centers for Disease Control and Prevention, National Center for Emerging and Zoonotic Infectious Diseases (NCEZID), Division of Foodborne, Waterborne, and Environmental Diseases (DFWED)
5. Earl, Dent et al. “Assemblathon 1: a competitive assessment of de novo short read assembly methods.” Genome research vol. 21,12 (2011): 2224-41.
doi:10.1101/gr.126599.111
6. Maccallum, Iain et al. “ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads.” Genome biology vol. 10,10 (2009):
R103. doi:10.1186/gb-2009-10-10-r103
7. Miller, Jason R et al. “Assembly algorithms for next-generation sequencing data.” Genomics vol. 95,6 (2010): 315-27. doi:10.1016/j.ygeno.2010.03.001
8. Pritt, J., Chen, N. & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol 19, 220 (2018). https://doi.org/10.1186/s13059-018-1595-x
9. Quainoo, S., Coolen, J.P., Hijum, S.A., Huynen, M.A., Melchers, W.J., Schaik, W.V., & Wertheim, H.F. (2017). Whole-Genome Sequencing of Bacterial Pathogens:
the Future of Nosocomial Outbreak Analysis. Clinical microbiology reviews, 30 4, 1015-1063 .
10. Rahman, A., Pachter, L. CGAL: computing genome assembly likelihoods. Genome Biol 14, R8 (2013). https://doi.org/10.1186/gb-2013-14-1-r8
11. Salzberg, Steven L et al. “GAGE: A critical evaluation of genome assemblies and assembly algorithms.” Genome research vol. 22,3 (2012): 557-67.
doi:10.1101/gr.131383.111
12. Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages
i884–i890, https://doi.org/10.1093/bioinformatics/bty560
13. Sohn, Jang-il; Nam, Jin-Wu. “The present and future of de novo whole-genome assembly”, Briefings in Bioinformatics, Vol 19.1 (2018).
doi.org/10.1093/bib/bbw096
14. Souvorov A., Agarwala R., & Lipman D.J. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology. 2018; 19(1).
doi:10.1186/s13059-018-1540-z
15. Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, Steven L. Salzberg, GAGE-B: an evaluation of genome assemblers
for bacterial organisms, Bioinformatics, Volume 29, Issue 14, 15 July 2013, Pages 1718–1725, https://doi.org/10.1093/bioinformatics/btt273
16. Zerbino, D., & Birney, E. (n.d.). Velvet: de novo assembly using very short reads. Hinxton: European Bioinformatics Institute.

Latest revision as of 11:12, 14 January 2021