Team III Genome Assembly Group

From Compgenomics 2020
Jump to navigation Jump to search

Introduction/Background

[put picture of final pipeline here]

Lectures

Quality Control/Trimming

Tools used:

Selection of Tools

For quality control, we compared two tools, fastp and FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We first proved that the two programs generated identical information when run on identical fastq files, after which we compared the information displayed in the reports for both. While FastQC creates highly informative per-base sequence quality graphs, it runs significantly slower than fastp. We were able to extract biologically significant information from the charts generated from fastp, so we decided to use fastp for quality control.

Afterwards, we compared fastp's trimming features with those of Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic). Our data did not contain adapters, so we did not need an additional tool like CutAdapt (https://cutadapt.readthedocs.io/en/stable/) to remove them. We showed that most, if not all, of Trimmomatic's trimming features can be replicated in fastp. Furthermore, fastp contains a feature specifically for paired-end data where it can use a high-read to correct low confidence bases in its mate. Fastp has the added advantage of combining both quality control and trimming into a single step, increasing the speed and usability of our pipeline. Since we had 50 input files in our pipeline, we used MultiQC to consolidate the 50 separate quality control reports generated by fastp into a single report.

Figure 2: Pre-trim quality control curves of our 50 samples, generated using fastp and MultiQC.

Pre-Trim Quality Control

  • Mate 1 is high quality, mate 2 is lower quality - read 1 requires conservative trimming
  • Worst region is at very beginning
  • Desired result: cut off the low-quality regions on both sides while retaining the high-quality middle regions

Trimming Parameters

According to (https://www.nature.com/articles/s41598-019-39076-7), trimming does not have a clear affect on assembly quality but speeds up most assembly implementations. Therefore, we decided to trim our data to remove low quality regions, especially the regions on the 5' end of read 2. The following arguments were supplied to fastp to trim our data: -f 5 -F 30 -t 10 -e 28 -c -5 3 -M 27.

  • -f 5 - globally trim 5 bases from front of mate 1
  • -F 30 - globally trim 30 bases from front of mate 2
  • -t 10 - trim 10 bases from end of both mates
  • -e 28 - discard reads with an average quality score under 28
  • -c - turns on paired-end base correction (has a small effect on quality at ends of mate 2)
  • -5 3 - turns on sliding window trimming from 5' end, only bases within the window that don't meet the threshold are discarded, window size of 3
  • -M 27 - quality threshold of 27 for sliding window

Post-Trimming Quality Control

Figure 3: Post-trim quality control curves of our 50 samples, generated using fastp and MultiQC.
Figure 4: Percentage of reads in each sample that passed the filters, generated using fastp and MultiQC.

provide some summary statistics saying that we didn't trim too much

Assembly Tools

Reference-based vs. de novo assembly

Tools considered

AbySS

MaSuRCA

SPAdes

SKESA

Parameters considered when selecting assembly tool

Methods for testing assembly tools

Post-Assembly Validation

quast

BUSCO

References