Team II Genome Assembly Group: Difference between revisions

From Compgenomics 2020
Jump to navigation Jump to search
Klacek3 (talk | contribs)
Klacek3 (talk | contribs)
Line 12: Line 12:
=== Quality Control and Trimming ===
=== Quality Control and Trimming ===


Before de Novo assembly can begin, input fastq files needed to be evaluated and processed for quality. Subsequently, low quality reads as well as adapters required trimming to ensure a more accurate genome assembly. The tool used for pre-trimming quality control was FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
Before de Novo assembly can begin, input fastq files needed to be evaluated and processed for quality. Subsequently, low quality reads as well as adapters required trimming to ensure a more accurate genome assembly. The tool used for pre-trimming quality control was FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Following installation, the 100 provided fastq files were called into FastQC and the data summary text files locations were recorded. Summary files were iterated over and per-base sequence quality was compared to a threshold of 30. The first position to pass the threshold was selected to be the headcropping point for trimming. Subsequently, the final position to pass the quality control parameters was selected to be the tail cropping point for trimming. Trimming was then completed by FastP (https://github.com/OpenGene/fastp).
 
Before FastP was selected as a tool for trimming, it was compared to Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic), another widely used trimming tool. FastP was found to be significantly faster than Trimmomatic, and also offered two major features that when implemented, optimized the complexity and runtime of this process. Firstly, FastP offers automatic adapter sensing, and since 50 paired end reads were passed through the pipeline, this feature allowed adapter trimmimg to be highly customized per fastq sequence. Secondly, output files from calling FastP include its own post-trim QC html and json files--negating the need for a subsequent calling of FastQC.
 
The 50 output html files were then combined using MultiQC (https://multiqc.info/) and the following parameters were found to optimize post-trim quality for all 50 inputs.


=== de Novo Assembly ===
=== de Novo Assembly ===

Revision as of 16:35, 29 January 2020

Team 2 Genome Assembly

Introduction

In-Class Presentations

File:Presentation 1.pdf

Pipeline Overview

Quality Control and Trimming

Before de Novo assembly can begin, input fastq files needed to be evaluated and processed for quality. Subsequently, low quality reads as well as adapters required trimming to ensure a more accurate genome assembly. The tool used for pre-trimming quality control was FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Following installation, the 100 provided fastq files were called into FastQC and the data summary text files locations were recorded. Summary files were iterated over and per-base sequence quality was compared to a threshold of 30. The first position to pass the threshold was selected to be the headcropping point for trimming. Subsequently, the final position to pass the quality control parameters was selected to be the tail cropping point for trimming. Trimming was then completed by FastP (https://github.com/OpenGene/fastp).

Before FastP was selected as a tool for trimming, it was compared to Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic), another widely used trimming tool. FastP was found to be significantly faster than Trimmomatic, and also offered two major features that when implemented, optimized the complexity and runtime of this process. Firstly, FastP offers automatic adapter sensing, and since 50 paired end reads were passed through the pipeline, this feature allowed adapter trimmimg to be highly customized per fastq sequence. Secondly, output files from calling FastP include its own post-trim QC html and json files--negating the need for a subsequent calling of FastQC.

The 50 output html files were then combined using MultiQC (https://multiqc.info/) and the following parameters were found to optimize post-trim quality for all 50 inputs.

de Novo Assembly

Post-Assembly QC

Species Identification

Works Cited

Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560 Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (2016) doi: 10.1093/bioinformatics/btw354 PMID: 27312411 Bankevich, Anton et al. “SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.” Journal of computational biology : a journal of computational molecular cell biology vol. 19,5 (2012): 455-77. doi:10.1089/cmb.2012.0021 https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/ https://bpa-csiro-workshops.github.io/btp-manuals-md/modules/btp-module-velvet/velvet/ ncbi.nlm.nih.gov/pmc/articles/PMC2952100/ Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072-1075. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.