Revision as of 17:01, 24 February 2020

Group Members: Deepali Kundnani, Aparna Maddala, Swetha Singu, Yiqiong Xiao, Ruize Yang

Lectures

Introduction

What is genome assembly? [expand]

Objective

We know that we have 50 paired-end samples sequenced on the Illumina platform. The origin of these samples are bacterial isolates collected by CDC PulseNET. [expand]

Methods

Pipeline for Testing

Figure 1: Proposed assembly pipeline. Four assembly tools, SPAdes, SKESA, ABySS, and MaSuRCA, were tested to determine the optimal assembler for our data.

Quality Control/Trimming

Tools used:

fastp (https://github.com/OpenGene/fastp)
MultiQC (https://multiqc.info/)

Tool Selection

For quality control, we compared two tools, fastp and FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We first proved that the two programs generated identical information when run on identical fastq files, after which we compared the information displayed in the reports for both. While FastQC creates highly informative per-base sequence quality graphs, it runs significantly slower than fastp. We were able to extract biologically significant information from the charts generated from fastp, so we decided to use fastp for quality control.

Afterwards, we compared fastp's trimming features with those of Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic). Our data did not contain adapters, so we did not need an additional tool like CutAdapt (https://cutadapt.readthedocs.io/en/stable/) to remove them. We showed that most, if not all, of Trimmomatic's trimming features can be replicated in fastp. Furthermore, fastp contains a feature specifically for paired-end data where it can use a high-read to correct low confidence bases in its mate. Fastp has the added advantage of combining both quality control and trimming into a single step, increasing the speed and usability of our pipeline. Since we had 50 input files in our pipeline, we used MultiQC to consolidate the 50 separate quality control reports generated by fastp into a single report.

Pre-Trim Quality Control

Figure 2: Pre-trim quality control curves of our 50 samples. Plot generated using fastp and MultiQC.

Read 1 had high quality reads, so each of the 50 read 1 samples passed the mean quality score threshold of 28. However, read 2 was of lower quality, with only 25 samples passing the same mean quality threshold. This was mostly due to low quality regions at the 5' end of read 2. Because of this notable difference in quality between the two reads, we decided to trim read 1 and read 2 using different parameters. According to (https://www.nature.com/articles/s41598-019-39076-7), trimming does not have a clear effect on assembly quality but speeds up most assembly implementations. In order to obtain high quality data to pass to the assemblers, we wanted to trim the low-quality regions at the 3' and 5' ends of each read while retaining the high-quality middle regions.

Trimming Parameters

The following arguments were supplied to fastp in order to trim our data: -f 5 -F 30 -t 10 -e 28 -c -5 3 -M 27.

-f 5 - globally trim 5 bases from front of mate 1
-F 30 - globally trim 30 bases from front of mate 2
-t 10 - trim 10 bases from end of both mates
-e 28 - discard reads with an average quality score under 28
-c - turns on paired-end base correction, which slightly increased the quality of the 3' end of mate 2
-5 3 - turns on sliding window trimming from 5' end, only bases within the window that don't meet the threshold are discarded, window size of 3
-M 27 - quality threshold of 27 for sliding window

Assembly

Reference-based vs. de novo assembly

Parameters considered when selecting assembly tool

[namedrop the two comparative studies we referenced here]

Tools considered

AbySS

MaSuRCA

SPAdes

SKESA

Methods for testing assembly tools

Results

Post-Trimming Quality Control

Figure 3: Post-trim quality control curves of our 50 samples, generated using fastp and MultiQC.

Figure 4: Percentage of reads in each sample that passed the filters, generated using fastp and MultiQC.

@@ Line 1: / Line 1: @@
-== Introduction/Background ==
+Group Members: Deepali Kundnani, Aparna Maddala, Swetha Singu, Yiqiong Xiao, Ruize Yang
-[[File:3_assembly_pipeline.png|border|center|700px]]
+== Lectures==
-<center>'''Figure 1:''' Proposed assembly pipeline.</center>
+* [[Media:Team3_genome_assembly_1.pdf| Team 3 Genome Assembly: Background and Strategy]]
-=== Lectures ===
+* [[Media:Team_3_Genome_Assembly_Results.pptx.pdf| Team 3 Genome Assembly: Results]]
+== Introduction ==
+What is genome assembly? [expand]
+== Objective ==
+We know that we have 50 paired-end samples sequenced on the Illumina platform. The origin of these samples are bacterial isolates collected by CDC PulseNET. [expand]
+== Methods ==
-* [[Media:Team3_genome_assembly_1.pdf| Team 3 Genome Assembly: Background and Strategy]]
+=== Pipeline for Testing ===
-* [[Media:Team_3_Genome_Assembly_Results.pptx.pdf| Team 3 Genome Assembly: Results]]
+[[File:3_assembly_pipeline.png|border|center|700px]]
+<center>'''Figure 1:''' Proposed assembly pipeline. Four assembly tools, SPAdes, SKESA, ABySS, and MaSuRCA, were tested to determine the optimal assembler for our data.</center>
-== Quality Control/Trimming ==
+=== Quality Control/Trimming ===
 Tools used:
@@ Line 15: / Line 26: @@
 * MultiQC (https://multiqc.info/)
-=== Selection of Tools ===
+==== Tool Selection ====
 For quality control, we compared two tools, fastp and FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We first proved that the two programs generated identical information when run on identical fastq files, after which we compared the information displayed in the reports for both. While FastQC creates highly informative per-base sequence quality graphs, it runs significantly slower than fastp. We were able to extract biologically significant information from the charts generated from fastp, so we decided to use fastp for quality control.
@@ Line 21: / Line 32: @@
 Afterwards, we compared fastp's trimming features with those of Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic). Our data did not contain adapters, so we did not need an additional tool like CutAdapt (https://cutadapt.readthedocs.io/en/stable/) to remove them. We showed that most, if not all, of Trimmomatic's trimming features can be replicated in fastp. Furthermore, fastp contains a feature specifically for paired-end data where it can use a high-read to correct low confidence bases in its mate. Fastp has the added advantage of combining both quality control and trimming into a single step, increasing the speed and usability of our pipeline. Since we had 50 input files in our pipeline, we used MultiQC to consolidate the 50 separate quality control reports generated by fastp into a single report.
-=== Pre-Trim Quality Control ===
+==== Pre-Trim Quality Control ====
 [[File:3_QC_pretrim.jpg|border|center|600px]]
 <center>'''Figure 2:''' Pre-trim quality control curves of our 50 samples. Plot generated using fastp and MultiQC.</center>
-Read 1 had high quality reads, so each of the 50 read 1 samples passed the mean quality score threshold of 28. However, read 2 was of lower quality, with only 25 samples passing the same mean quality threshold. This was mostly due to low quality regions at the 5' end of read 2. Because of this notable difference in quality between the two reads, we decided to trim read 1 and read 2 using different parameters. According to (https://www.nature.com/articles/s41598-019-39076-7), trimming does not have a clear effect on assembly quality but speeds up most assembly implementations. In order to obtain high quality data to pass to the assemblers, we wanted to trim the low-quality regions at the 3' and 5' ends of each read while
+Read 1 had high quality reads, so each of the 50 read 1 samples passed the mean quality score threshold of 28. However, read 2 was of lower quality, with only 25 samples passing the same mean quality threshold. This was mostly due to low quality regions at the 5' end of read 2. Because of this notable difference in quality between the two reads, we decided to trim read 1 and read 2 using different parameters. According to (https://www.nature.com/articles/s41598-019-39076-7), trimming does not have a clear effect on assembly quality but speeds up most assembly implementations. In order to obtain high quality data to pass to the assemblers, we wanted to trim the low-quality regions at the 3' and 5' ends of each read while retaining the high-quality middle regions.
-retaining the high-quality middle regions.
-=== Trimming Parameters ===
+==== Trimming Parameters ====
 The following arguments were supplied to fastp in order to trim our data: <code>-f 5 -F 30 -t 10 -e 28 -c -5 3 -M 27.</code>
@@ Line 41: / Line 51: @@
 * -M 27 - quality threshold of 27 for sliding window
-=== Post-Trimming Quality Control ===
+=== Assembly ===
+==== Reference-based vs. de novo assembly ====
+==== Parameters considered when selecting assembly tool ====
-[[File:3_QC_posttrim.jpg|border|center|600px]]
+[namedrop the two comparative studies we referenced here]
-<center>'''Figure 3''': Post-trim quality control curves of our 50 samples, generated using fastp and MultiQC.</center>
-[[File:3_filtered_reads.png|border|center|600px]]
-<center>'''Figure 4''': Percentage of reads in each sample that passed the filters, generated using fastp and MultiQC.</center>
-provide some summary statistics saying that we didn't trim too much
+==== Tools considered ====
-== Assembly Tools ==
+===== AbySS =====
-=== Reference-based vs. de novo assembly ===
+===== MaSuRCA =====
-=== Parameters considered when selecting assembly tool ===
+===== SPAdes =====
-=== Tools considered ===
+===== SKESA =====
-==== AbySS ====
+==== Methods for testing assembly tools ====
-==== MaSuRCA ====
+== Results ==
-==== SPAdes ====
+=== Post-Trimming Quality Control ===
-==== SKESA ====
+[[File:3_QC_posttrim.jpg|border|center|600px]]
+<center>'''Figure 3''': Post-trim quality control curves of our 50 samples, generated using fastp and MultiQC.</center>
+[[File:3_filtered_reads.png|border|center|600px]]
+<center>'''Figure 4''': Percentage of reads in each sample that passed the filters, generated using fastp and MultiQC.</center>
-=== Methods for testing assembly tools ===
+provide some summary statistics saying that we didn't trim too much
-== Post-Assembly Validation ==
+=== Post-Assembly Validation ===
-=== quast ===
+==== quast ====
-=== BUSCO ===
+==== BUSCO ====
-== References ==
+= References =

Team III Genome Assembly Group: Difference between revisions

Revision as of 17:01, 24 February 2020

Contents

Lectures

Introduction

Objective

Methods

Pipeline for Testing

Quality Control/Trimming

Tool Selection

Pre-Trim Quality Control

Trimming Parameters

Assembly

Reference-based vs. de novo assembly

Parameters considered when selecting assembly tool

Tools considered

AbySS

MaSuRCA

SPAdes

SKESA

Methods for testing assembly tools

Results

Post-Trimming Quality Control

Post-Assembly Validation

quast

BUSCO

References

Navigation menu

Team III Genome Assembly Group: Difference between revisions

Revision as of 17:01, 24 February 2020

Lectures

Introduction

Objective

Methods

Pipeline for Testing

Quality Control/Trimming

Tool Selection

Pre-Trim Quality Control

Trimming Parameters

Assembly

Reference-based vs. de novo assembly

Parameters considered when selecting assembly tool

Tools considered

AbySS

MaSuRCA

SPAdes

SKESA

Methods for testing assembly tools

Results

Post-Trimming Quality Control

Post-Assembly Validation

quast

BUSCO

References

Navigation menu

Search