Team I Webserver Group
Members: Devishi Kesar, Shuheng Gan, Winnie Zheng, Priya Narayanan, Aaron Pfennig
Introduction
Here is our final presentation for Webserver: File:Web-server presentation.pdf
Background
The primary purpose of our team is to develop a pipeline to analyze the unassembled Escherichia coli(E.Coli) sequence from 50 raw datasets in order to predict the pathogenicity and the closest related strain. In order to reach this goal, we utilize different computational genomics tools, including Genome Assembly, Gene Prediction, Functional Annotation, and Comparative Genomics. Therefore, we develop a web-server which can not only complete our main goal for the specific sequence reads but also allow more general sequences to be handled faster and more comfortable (not separate but generalize those tools in one pipeline). In other words, this web-server is for analyzing different sequences in order to predict the pathogenicity and visualize the closest related strain under a more convenient, faster, and accurate method.
Objective
- Provide a comprehensive, automated platform to analyze E.coli isolates in order to predict virulence factors and outbreak cluster
- Functionalities of the webserver:
- Identify virulence factors/microbial resistance and outbreak response for provided isolates
- Allow data upload at each step of outline pipeline
- Visualize findings in a comprehensible way
- Design
- Intuitive usage
- Provide only essential options
WebServer
- Structure
- In order to build a functional web-server, we need to construct front end and back end separately.
- Front End: Everything involved with what the user sees(Web-browser).
- Back End: How the site works, updates, and changes.
- In order to build a functional web-server, we need to construct front end and back end separately.
- Access to Webserver
Here is Link to access our webserver: echo-Team1_webserver
Functionalities
Genome Assembly
- Performs de-novo assembly with FastQ files as input
- Runs following tools by default:
- fastp: read pre-processing
- Unicycler: Genome assembly
- Options:
- Perform read preprocessing
- Kmer-size
- Spades as alternative assembly method
- The input FastQ files must be paired-end reads
- Outputs as FASTA file
- Visualisation: Quast output
- For more details to visit: Team1_Genome_Assembly
Gene Prediction
- Gene finding in assembled isolates or provided FASTA fileTakes FastQ files as input
- Runs following tools by default:
- CDS: Prodigal
- tRNA: Aragorn
- rRNA: barrnap
- Options:
- GeneMarkS-2 as alternative tool for CDS predictions
- tRNAscan-SE as alternative tool for tRNA predictions
- RNAmmer as alternative tool for rRNA predictions
- Outputs as *.gff file, *_cds.fna file, *_protein.faa file and *_rna.fna file
- For more details to visit: Team1_Gene_Prediction
Functional Annotation
- Obtain functional information about predicted genes
- Input: FASTA file
- Cluster Tool: usearch
- Output: centroid.fasta
- Homology Tools:
- General annotation: InterProScan, EggNOGmapper
- Antibiotic resistance gene: DeepARG
- Abinitio Tools:
- Signal Peptides: SignalP 5.0
- Transmembrane Proteins: TMHMM
- CRISPR Sites: PilerCR
- Output: *.tsv file
- For more details to visit: Team1_Functional_Annotation
Comparative Genomics
- Comparison of genomic features of input files to identify outbreak cluster
- Input: FASTA file, prodigal training file(chewBBACA)
- Tools used:
- MUMmer 4.0
- chewBBACA
- kSNP 3.0
- FigTree
- Options:
- Parsimony tree, maximum likelihood and neighbour joining trees as option for kSNP
- k-mer size option for kSNP
- Output: .tsv file(for chewBBACA, MUMmer), .pdf(kSNP)
- Visualisation: Phylogenetic tree for identified SNP’s, phylogenetic tree for MLST, graph for epidemiological data visualisation
- For more details to visit: Team1_Comparative_Genomics
Method
- WebServer Building
- We build Apache webserver through installing apache and mod_wsgi
- Apache Apache_HTTP_Server: Apache is a free and open-source cross-platform web-server software, which contains a different type of modules.
- Mod_wsgi Mod_wsgi: Among those modules, we choose mod_wsgi provides a WSGI compliant interface in order to host python based(both python 2 and 3) web applications under Apache.
- We build Apache webserver through installing apache and mod_wsgi
- Data visualization
- We generate the final results through using basic CSS and html
Webserver Demo
Choice One: Running General Pipeline
The user could process their raw genome sequence data in one click for four distinct stages of analysis and interpretation.
- Click Analyze and choose General Pipeline.
- Upload a compressed folder or metadata on here(Be careful about the file type, metadata is optional but not complementary for compressed folder).
- Enter your email want to receive the final result image and datasets.
- Click "RUN", the user will gain the notification from his or her email
Choice Two: Running each step separately
The user could process their raw genome sequence data in four distinct stages of analysis and interpretation separately which allow them to choose any tools they prefer or gain the results of each stage separately.
- Click Analyze and choose any tools user want to use
- Upload a compressed folder or metadata on here(Be careful about the file type, metadata is optional but not complementary for compressed folder).
- Make a proper choice for each option base on users requirement
- Enter the email where the user wants to receive the result image and datasets for each tool.
- Click "RUN", the user will gain the notification from his or her email
Results
- We have analyzed 50 E.coli isolates, including epi data, with respect to a foodborn outbreak. The data has been analyzed using this web server and the final results are presented below. We analyzed the data by performing de-novo genome assembly, gene prediction, functional annotation and comparative genomics. We have paid special attention to the virulence factors, possible food sources and the outbreak location.
- First, we have chosen one of our isolates as references and have determined the Average Nucleotide Identity of all other isolates with respect to our selected reference. Three isolates have a relatively low ANI of approximately 84%. Three other isolates have an ANI between 97-98% signifying some differences in regions of the genome with respect to the reference. All other 44 isolates are closely related to the reference genome with an ANI of approximately 99%. To determine the ANI MUMMER-4.0 has been used which has low resolution and does not discriminate more details about differences between highly similar genomes. The plot below shows the ANIs with respect to the reference genome:
- Subsequently, we have performed MLST analysis using chewBBACA to create a schema and do allele calling on the assembled genomes of the 50 isolates. Thereby, cluster outbreak isolates have been identified. These preliminary results combined with epidemiological data allowed to narrow down outbreak locations. The results were visualized using Grapetree and are shown below:
- In our strain analysis we incorporated functional annotation results. The results supported the results from the MLST analysis yielding hints on three food sources: melons, chorizo and bananas which are all served at certain brunch places. Furthermore, we asked the question whether these clear strains, possessed of clear genetic relatedness, are treatable in a similar fashion. Therefore, these strains have been analyzed using deepARG and (fortunately) they are identical on this basis and vulnerable to phenicol and sulfonamides. The insights gained from the MLST analysis, strain analysis and epidemiological data is depicted below:
- In addition SNP analysis has been performed using kSNP 3.0. The optimal k-mer size has been determined using Kchooser which yields, as a nice feature, the fraction of kmers that are present in all genomes. The FCK value is a measure of sequence diversity and hence a measure of relatedness. The lower the FCK, the more diverse and hence the more distantly related. The FCK is 0.422. Studies have shown when FCK is ≥ 0.1 SNP detection efficiency is adequate, and the accuracy of parsimony trees estimated by kSNP3 is > 97%; i.e. the trees can be considered to be reliable. The tree is shown below:
Reference
1. Maiden MC, Jansen van Rensburg MJ, Bray JE, et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nat Rev Microbiol. 2013;11(10):728-36.
2. Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, et al. (2018) MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14(1): e1005944. https://doi.org/10.1371/journal.pcbi.1005944
3. Perez-Losada M, Arenas M, Castro-Nallar E. Microbial sequence typing in the genomic era. Infection, Genetics and Evolution. 2018;63:346-359. http://dx.doi.org/10.1016/j.meegid.2017.09.022
4. Strockbine N, Bopp C, Fields P, Kaper J, Nataro J. 2015. Escherichia, Shigella, and Salmonella, p 685-713. In Jorgensen J, Pfaller M, Carroll K, Funke G, Landry M, Richter S, Warnock D (ed), Manual of Clinical Microbiology, Eleventh Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817381.ch37
5. Sultan, I., Rahman, S., Jan, A. T., Siddiqui, M. T., Mondal, A. H., & Haq, Q. M. R. (2018). Antibiotics, Resistome and Resistance Mechanisms: A Bacterial Perspective. Frontiers in Microbiology, 9(2066). doi:10.3389/fmicb.2018.02066
6. Trees E, Rota P, Maccannell D, Gerner-smidt P.. Molecular Epidemiology, p 131-159. In Jorgensen J, Pfaller M, Carroll K, Funke G, Landry M, Richter S, Warnock D (ed), Manual of Clinical Microbiology, Eleventh Edition. ASM Press, Washington, DC. 2015. doi: 10.1128/9781555817381.ch10
7. Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166
8. Z Zhou, NF Alikhan, MJ Sergeant, N Luhmann, C Vaz, AP Francisco, JA Carrico, M Achtman (2018) "GrapeTree: Visualization of core genomic relationships among 100,000 bacterial pathogens", Genome Res; doi: https://doi.org/10.1101/gr.232397.117